46 research outputs found

    Does dynamic and speculative parallelization enable advanced parallelizing and optimizing code transformations?

    Get PDF
    International audienceThread-Level Speculation (TLS) is a dynamic and automatic parallelization strategy allowing to handle codes that cannot be parallelized at compile-time, because of insufficient information that can be extracted from the source code. However, the proposed TLS systems are strongly limited in the kind of parallelization they can apply on the original sequential code. Consequently, they often yield poor performance. In this paper, we give the main reasons of their limits and show that it is possible in some cases for a TLS system to handle more advanced parallelizing transformations. In particular, it is shown that codes characterized by phases where the memory behavior can be modeled by linear functions, can take advantage of a dynamic use of the polytope model

    CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions

    Get PDF
    Efficient Total Store Order (TSO) implementations allow loads to execute speculatively out-of-order. To detect order violations, the load queue (LQ) holds all the in-flight loads and is searched on every invalidation and cache eviction. Moreover, in a simultaneous multithreading processor (SMT), stores also search the LQ when writing to cache. LQ searches entail considerable energy consumption. Furthermore, the processor stalls upon encountering the LQ full or when its ports are busy. Hence, the LQ is a critical structure in terms of both energy and performance. In this work, we observe that the use of the LQ could be dramatically optimized under the guarantees of the datarace-free (DRF) property imposed by modern programming languages. To leverage this observation, we propose CELLO, a software-hardware co-design in which the compiler detects memory operations in DRF regions and the hardware optimizes their execution by safely skipping LQ searches without violating the TSO consistency model. Furthermore, CELLO allows removing DRF loads from the LQ earlier, as they do not need to be searched to detect consistency violations. With minimal hardware overhead, we show that an 8-core 2- way SMT processor with CELLO avoids almost all conservative searches to the LQ and significantly reduces its occupancy. CELLO allows i) to reduce the LQ energy expenditure by 33% on average (up to 53%) while performing 2.8% better on average (up to 18.6%) than the baseline system, and ii) to shrink the LQ size from 192 to only 80 entries, reducing the LQ energy expenditure as much as 69% while performing on par with a mainstream LQ implementation

    Handling Multi-Versioning in LLVM: Code Tracking and Cloning

    Get PDF
    International audienceInstrumentation by sampling, adaptive computing and dynamic optimization can be efficiently implemented using multiple versions of a code region. Ideally, compilers should automatically handle the generation of such multiple versions. In this work we discuss the problem of multi-versioning in the situation where each version requires a different intermediate representation. We expose the limits of nowadays compilers regarding these aspects and provide our solutions to overcome them, using the LLVM compiler as our research platform. The paper is focused on three main aspects: tracking code in LLVM IR, cloning, and communication between low-level and high-level representations. Aiming at performance and minimal impact on the behavior of the original code, we describe our strategies to guide the interaction between the newly inserted code and the optimization passes, from annotating code using metadata to inlining assembly code in LLVM IR. Our target is performing code instrumentation and optimization, with an interest in loops. We build a version in x86_64 assembly code to acquire low-level information, and multiple versions in LLVM IR for performing high-level code transformations. The selection mechanism consists in callbacks to a generic runtime system. Preliminary results on the SPEC CPU 2006 and the Pointer Intensive Benchmark suite show that our framework has a negligible overhead in most cases, when instrumenting the most time consuming loop nests

    VMAD: a Virtual Machine for Advanced Dynamic Analysis of Programs

    Get PDF
    International audienceIn this paper, we present a virtual machine, VMAD (Virtual Machine for Advanced Dynamic analysis), enabling an efficient implementation of advanced profiling and analysis of programs. VMAD is organized as a sequence of basic operations where external modules associated to specific profiling strategies are dynamically loaded when required. The program binary files handled by VMAD are previously instrumented at compile time to include necessary data, instrumentation instructions and callbacks to the VM. Dynamic information, such as memory locations of launched modules, are patched at startup in the binary file. The LLVM compiler has been extended to automatically instrument programs according to both VMAD and the handled profiling strategies. VMAD's potential is illustrated by presenting a profiling strategy dedicated to loop nests. It collects all memory addresses that are accessed during a selected number of successive iterations of each loop. The collected addresses are consumed by an analysis process trying to interpolate addresses successively accessed through each memory reference as a linear function of some virtual loop indices. This profiling strategy using VMAD has been run on some of the SPEC2006 and Pointer Intensive benchmark suites, showing a very low time overhead, in most cases

    Adaptation du modèle polyhédrique à la parallélisation dynamique et spéculatice

    No full text
    In this thesis, we present a Thread-Level Speculation (TLS) framework whose main feature is to speculatively parallelize a sequential loop nest in various ways, to maximize performance. We perform code transformations by applying the polyhedral model that we adapted for speculative and runtime code parallelization. For this purpose, we designed a parallel code pattern which is patched by our runtime system according to the profiling information collected on some execution samples. We show on several benchmarks that our framework yields good performance on codes which could not be handled efficiently by previously proposed TLS systems.Dans cette thèse, nous décrivons la conception et l'implémentation d'une plate-forme logicielle de spéculation de threads, ou fils d'exécution, appelée VMAD, pour "Virtual Machine for Advanced Dynamic analysis and transformation", et dont la fonction principale est d'être capable de paralléliser de manière spéculative un nid de boucles séquentiel de différentes façons, en ré-ordonnançant ses itérations. La transformation à appliquer est sélectionnée au cours de l'exécution avec pour objectifs de minimiser le nombre de retours arrières et de maximiser la performance. Nous effectuons des transformations de code en appliquant le modèle polyédrique que nous avons adapté à la parallélisation spéculative au cours de l'exécution. Pour cela, nous construisons au préalable un patron de code qui est "patché" par notre "runtime", ou support d'exécution logiciel, selon des informations de profilage collectées sur des échantillons du temps d'exécution. L'adaptabilité est assurée en considérant des tranches de code de tailles différentes, qui sont exécutées successivement, chacune étant parallélisée différemment, ou exécutée en séquentiel, selon le comportement des accès à la mémoire observé. Nous montrons, sur plusieurs programmes que notre plate-forme offre de bonnes performances, pour des codes qui n'auraient pas pu être traités efficacement par les systèmes spéculatifs de threads proposés précédemment

    A Cost-Effective Entangling Prefetcher for Instructions

    No full text
    Prefetching instructions in the instruction cache is a fundamental technique for designing high-performance computers. There are three key properties to consider when designing an efficient and effective prefetcher: timeliness, coverage, and accuracy. Timeliness is essential, as bringing instructions too early increases the risk of the instructions being evicted from the cache before their use and requesting them too late can lead to the instructions arriving after they are demanded. Coverage is important to reduce the number of instruction cache misses and accuracy to ensure that the prefetcher does not pollute the cache or interacts negatively with the other hardware mechanisms. This paper presents the Entangling Prefetcher for Instructions that entangles instructions to maximize timeliness. The prefetcher works by finding which instruction should trigger the prefetch for a subsequent instruction, accounting for the latency of each cache miss. The prefetcher is carefully adjusted to account for both coverage and accuracy. Our evaluation shows that with 40KB of storage, Entangling can increase performance up to 23%, outperforming state-of-the-art prefetchers

    The Entangling Instruction Prefetcher

    No full text
    Prefetching instructions in the instruction cache is a fundamental technique for designing high-performance computers.To achieve maximum performance, there are three key properties one needs to consider in designing an efficient and effective prefetcher: (1) timeliness, (2) coverage, and (3) accuracy. Timeliness is an essential property of a prefetcher. Bringing instructions too early increases the risk of the instructions being evicted from the cache before their use and requesting them too late can lead to the instructions arriving past their designated execution time. Coverage is essential to effectively reduce the number of instruction cache misses (there is enough prefetching) and accuracy to ensure that the prefetcher does not pollute the cache or interacts negatively with the other hardware mechanisms (there is not too much prefetching). This paper presents the Entangling Prefetcher for Instructions (EPI) that entangles instructions to provide timeliness. The prefetcher works by finding which instruction should trigger the prefetch for a subsequent instruction, accounting for the latency of each prefetch. The prefetcher is carefully adjusted to account for both coverage and accuracy. Our evaluation shows that EPI increases performance by 29.5% on average, with a coverage of 95.6% and accuracy of 77.0%
    corecore